VirtuousPocketome: a computational tool for screening protein–ligand complexes to identify similar binding sites

Protein residues within binding pockets play a critical role in determining the range of ligands that can interact with a protein, influencing its structure and function. Identifying structural similarities in proteins offers valuable insights into their function and activation mechanisms, aiding in predicting protein–ligand interactions, anticipating off-target effects, and facilitating the development of therapeutic agents. Numerous computational methods assessing global or local similarity in protein cavities have emerged, but their utilization is impeded by complexity, impractical automation for amino acid pattern searches, and an inability to evaluate the dynamics of scrutinized protein–ligand systems. Here, we present a general, automatic and unbiased computational pipeline, named VirtuousPocketome, aimed at screening huge databases of proteins for similar binding pockets starting from an interested protein–ligand complex. We demonstrate the pipeline's potential by exploring a recently-solved human bitter taste receptor, i.e. the TAS2R46, complexed with strychnine. We pinpointed 145 proteins sharing similar binding sites compared to the analysed bitter taste receptor and the enrichment analysis highlighted the related biological processes, molecular functions and cellular components. This work represents the foundation for future studies aimed at understanding the effective role of tastants outside the gustatory system: this could pave the way towards the rationalization of the diet as a supplement to standard pharmacological treatments and the design of novel tastants-inspired compounds to target other proteins involved in specific diseases or disorders. The proposed pipeline is publicly accessible, can be applied to any protein–ligand complex, and could be expanded to screen any database of protein structures.

In the field of structural biology, it is widely recognized that there is a strong relationship between the threedimensional structure of a protein and its function 1 .The recognition and analysis of structural similarities in proteins can represent a valuable strategy to gain insights into protein functions.In particular, the comparison of protein binding sites represents a challenging area of interest in the field of biology and biochemistry to improve the understanding of protein-ligand interactions, predict off-target effects, facilitate the development of more selective and effective therapeutic agents, investigate drug repurposing strategies, and explore polypharmacological treatments 2 .In particular, drug repurposing, also known as drug repositioning or drug reprofiling, refers to the process of identifying new therapeutic uses for existing drugs that were originally developed for a different indication 3 .This approach offers several advantages, including potentially shorter development timelines, reduced costs, and a higher likelihood of success compared to developing entirely new compounds 4 .On the other hand, polypharmacology refers to the ability of a drug or a pharmacological agent to interact with multiple biological targets within an organism 5 .In other words, a polypharmacological drug can affect multiple pathways, receptors, or proteins simultaneously.This is in contrast to traditional drug development, where the focus is often on ligand of interest.The algorithm builds structural motifs of the target binding site of the query receptor-ligand complex after clustering a molecular dynamics trajectory and then searches for similar patterns inside a specific protein database using the ASSAM code 31 followed by additional ad-hoc filtering steps.We applied our computational pipeline to screen the currently solved human proteome for proteins that exhibit a highly similar local amino acid pattern to the one lining the strychnine binding site in the human TAS2R46 bitter taste receptor.The rationale of the work is to explore the taste transduction pathway with a proteomic perspective to elucidate the possible role of tastants beyond the mere taste perception and to investigate whether other classes of proteins have a conserved ability to recognize such ligands, with possible implications in nutrition, homeostasis, and disease.

Results
Details concerning convergence of the MD simulations and structural equilibrium of investigated molecular models are reported in the Supplementary Information (see also Figs.S1 and S2).The last 50 ns of each simulation replica were considered as structural equilibrium and were concatenated to obtain a final 150 ns-long trajectory representing the ensemble of protein conformations.These concatenated trajectories were used for the subsequent analysis and to search for similar binding pockets within the human proteome through the pipeline described herein.

Similarity search and multi-step filtering
The similarity search against the entire human proteome consisting of 58,972 structures (see the Database Curation section) resulted in a total of 6718 hits using the ASSAM code.The subsequent steps, i.e. the SASA and docking filtering steps, yielded a total of 1852 and 257 hits respectively (Fig. 2A).In detail, we adopted a SASA threshold of 0.75, thus preserving all protein hits having a SASA in the binding pocket of at least 75% of the SASA of the original query binding site; on the other hand, we chose a docking threshold equal to 0.1, thus keeping the hits whose docking score would not differ by more than the 10% from the docking score of the query protein/ ligand complex.The best hit in terms of docking score after the multi-step filtering process, namely PDB 3A4S, is represented in Fig. 2B, highlighting the correspondence between the original motifs in the bitter taste receptor and the relative matching residues in the identified protein.The pipeline automatically generated a report file (see 'Results.txt'file in the Supplementary Material) which includes the list of all the hits at the end of the screening process with additional information, such as PDB IDs with doi of the relative publication, protein classes, shared residues between hits and query, SASA and Docking Scores.The complete list of the 257 retrieved protein hits is reported in Table S2.

Functional enrichment and signaling pathway analyses
VirtuousPocketome retrieved 145 unique Uniprot IDs relative to the previously identified hits, meaning that multiple PDBs at the end of the multistep filtering process corresponded to the same protein.The DAVID software was then employed to analyse the Gene Ontology terms and the signalling pathways data as described in the "Methods" section.
The functional enrichment analysis revealed that the input genes were significantly enriched for a total of 16 GO terms in the Cellular Components category, 10 terms in the Biological Processes category and 14 terms in the Molecular Functions category, based on the corrected p-value.In addition, 0 KEGG and 0 Reactome pathways were found to be significantly enriched at the same p-value threshold.The best 5 GO terms for each of the above-mentioned categories are represented in Fig. 3, whereas all the significantly retrieved GO terms are represented in the Supplementary Information (Figs.S4 and S5).Regarding the Biological Processes (BP), most of the retrieved genes are related to metabolic processes (40.9%), including organic substance, cellular and nitrogen compound metabolic processes.Besides, the most represented Molecular Functions (MF) are related to the binding of different species, such as proteins, small compounds or ions, and to enzyme activity, such as transferase, hydrolase, and oxidoreductase.Finally, regarding the last analysed GO term, the most represented Cellular Components (CC) are cytoplasm and membrane (39.0%).

Discussion
Herein, we developed a novel computational pipeline, named VirtuousPocketome, to screen a desired database for proteins sharing similar binding sites with a specific protein of interest.VirtuousPocketome is based on four major steps: (i) the motifs creation step, which identifies the most important residues involved in the interaction of the protein-ligand complex under investigation; (ii) the similarity search step, in which the human solved proteome is screened for similar motifs compared to the ones retrieved in the previous step; (iii) the multi-step filtering step, which preserves only the protein hits with binding sites effectively accessible and with a certain docking affinity for the query ligand; (iv) functional enrichment and signalling pathway analyses, which identifies the most relevant cellular components, molecular functions, biological processes and signalling pathways related to the protein hits identified in the previous steps.VirtuousPocketome takes as input the molecular structure (and eventually also the molecular dynamics trajectory) of a protein-ligand complex and gives as output a list of PDB structures sharing similar binding sites.The developed protocol is automatic and unbiased and can be applied to any protein-ligand complex.
To evaluate the developed code and prove its potential, we here selected the human TAS2R46 bitter taste receptor bound to a bitter ligand, i.e. strychnine and we screened the entire human proteome to search for similar structural motifs.The choice for this molecular system was driven by the recent experimental determination of the TAS2R46-strychnine complex structure 71 , as well as the fact that strychnine is experimentally known to target not only TAS2R46 67 but also other bitter taste receptors, such as TAS2R10 72 , and even other proteins 73 .Moreover, bitter taste receptors have been widely investigated through computational molecular modelling in past years, ensuring enough data for comparison and evaluation of some of the results of the platform 68,72,[74][75][76][77] .Furthermore, the strong relationship between food intake and health status, including the regulation of homeostasis and metabolism, makes the chosen molecular machinery a particularly intriguing and relevant testbed for our computational screening pipeline to pinpoint possible secondary targets outside the gustatory system for food-related tastants.
The first step of the proposed pipeline, i.e. the motifs creation step, applied to the TAS2R46 bitter taste receptor bound to strychnine pinpointed three major motifs of residues mostly involved in the ligand binding.In particular, all three motifs shared a salt-bridge interaction with residue GLU265 7.39 and two hydrophobic interactions with residues TYR85 3.29 and TRP88 3.32 , whereas the first motif comprised also hydrophobic interactions with residues THR69 2.64 , PHE252 6.58 , and PHE261 7.35 and the third motif with VAL61 2.56 , and VAL249 6.55 (Fig. 1).Interestingly, some of these residues have already been suggested by previous literature to be important interactions for ligand binding of bitter taste receptors.In detail, GLU265 7.39 and TRP88 3.32 were demonstrated to be pivotal in TAS2R46 activation by strychnine 71 .Moreover, mutagenesis studies have reported the importance of residue GLU265 7.39 for for agonist responsiveness of TAS2R46 67 and the same position has been also linked to ligand binding for similar receptors, including hydroxytryptamine (5-HT) receptors 78,79 , adrenergic receptors 80,81 , purinergic receptors 82,83 , and cholecystokinin-B (CCK-B)/gastrin receptor 84 .Moreover, residue TRP88 3.32 is widely conserved among TAS2Rs and was found to be crucial in the activation of TAS2R43, TAS2R30 and TAS2R46 67,85 .The importance of residue in position 3.29 (TYR85 for TAS2R46) was confirmed also for TAS2R10, which shows a similar binding site to TAS2R46 and is also activated by strychnine 72 .These pieces of the literature confirmed the reliability of the motifs creation step of the proposed pipeline to pinpoint the most important and relevant residues involved in the ligand binding.It is worth mentioning that the possibility of analysing a molecular dynamics trajectory allows the identification of multiple motifs, three in the present case, one for each identified cluster centroid, ensuring a more exhaustive sampling of the protein-ligand interactions.
Starting from the identified motifs, similar amino acid patterns were searched in the entire currently solved human proteome, consisting of 58,972 structures.The similarity search using the ASSAM code resulted in 6718 hits.Then, the multi-step filtering reduced the number of detected sites to a total of 257 PDB structures, which www.nature.com/scientificreports/represent 0.44% of the original database and 3.83% of the structures in ASSAM code output (Fig. 2A).It is worth mentioning that in the final set of 257 hits notable entries include the potassium voltage-gated channel and acetylcholinesterase.The potassium voltage-gated channel is known to be modulated by strychnine (https:// pubch em.ncbi.nlm.nih.gov/ bioas say/ 588834).Although there is no experimental assay in the literature for the binding affinity between strychnine and acetylcholinesterase, strychnine is recognized to bind to the acetylcholine receptor, which binds acetylcholine similarly to the acetylcholinesterase 86,87 .The presented approach was revealed to be effective in discarding a high number of spurious proteins whose matching motifs were unlikely to bind strychnine, due to insufficient solvent exposure or low predicted affinity obtained from molecular docking.It is essential to highlight that while this pipeline can identify structures with amino acid patterns similar to a specified binding pocket, the actual ability of the ligand to interact with the identified proteins needs validation through experimental methods.This study is presented as a screening platform designed to pinpoint proteins that are more likely to be promising targets for the given ligand, with the understanding that experimental testing is crucial for confirmation.
The functional enrichment analysis allowed us to pinpoint the main biological processes, molecular functions, and cellular components related to the gene expressing the retrieved hit proteins at the end of the Virtuous-Pocketome pipeline (Fig. 3).Most of the biological processes highlighted are connected to metabolic processes, which seems an intriguing result considering the strong relationship between taste perception, food intake and metabolism.Indeed, these results might indicate that strychnine has not only the ability to activate the TAS2R46 bitter taste receptor and elicit the bitter taste sensation, but it should be also the potential trigger or modulator of proteins directly involved in the metabolic processes.Moreover, the protein hits are mainly involved in molecular functions related to the binding of proteins or other small compounds and are localized mainly in the cytoplasm and membrane.These results seem pretty reasonable considering that TAS2R46 is a transmembrane protein and a promiscuous bitter taste receptor, able to bind a wide spectrum of chemical compounds.In light of these results, it is worth mentioning that strychnine is known to bind glycine and acetylcholine receptors 73 , which are membrane proteins involved in transport and signalling functions.Therefore, the presented pipeline was able to detect similar binding sites in receptors with the same localization and biological function of known strychnine targets.In addition, strychnine is a rather promiscuous ligand with anti-plasmodial and anti-cancer activity and other yet-unresolved molecular targets 88 .
In conclusion, in the present work, we developed a novel, general and automatic pipeline for identifying proteins sharing similar binding pockets with a query receptor-ligand complex by screening the entire solved human proteome.The developed pipeline is also freely accessible at https:// virtu ous.isi.gr/#/ virtu ous-pocke tome and will be easily expanded in the future to other protein databases.We used the TAS2R46-strychnine complex as a testbed for the proposed method to investigate if other proteins except the taste receptors might share similar binding pockets for the recognition of tastants.The proposed methodology allows for a deep investigation of the TAS2R46-specific residues needed for the strychnine binding to pinpoint other human proteins sharing similar binding pockets and investigate the potential roles that this tastant could play in contexts beyond the gustatory system.The retrieved proteins could be further analyzed to predict whether the interaction with the compound of interest results in their activation or modulation.This will increase our understanding of possible secondary effects of tastants beyond the mere taste perception and their impact on the biological processes and molecular functions in which the retrieved hit proteins are involved.This approach can also assist the design of specific foods and ingredients to develop personalised treatments able to target desired proteins or receptors involved in specific processes or diseases with the ultimate goal of accessing the potential of the diet as a supplement to traditional pharmacological treatments.

Overall workflow
The proposed computational pipeline requires as mandatory inputs the coordinates (PDB file) of a protein-ligand complex and the chain label(s) uniquely identifying the receptor(s) and the ligand in the provided structure.In order to allow the ligand to accommodate in the binding site and to assess the dynamic nature of the protein-ligand binding, the user can specify a molecular dynamics trajectory (GROMACS xtc/trr/pdb file), thus considering several configurations of the investigated complex.Moreover, additional custom parameters can also be defined by experienced users if needed (details in the following).
At present, the code is limited to the screening of the entire human proteome, but it can be easily expanded to any PDB-like database.
The overall workflow of the algorithm designed in the present work is divided into four main steps, each of which will be described in detail in the following paragraphs: Motifs Creation, Similarity Search, Multi-step Filtering, and Enrichment and Signaling Pathway Analyses.
The main output consists of a Txt file collecting the PDBs of the identified proteins sharing similar accessible binding site(s) compared to the query receptor-ligand complex.Additionally, the code provides as output the list of unique UniProt IDs related to the retrieved proteins, plots summarising the main results (see the "Results" section), and visualisation states of the molecular system under investigation.
The overall workflow is represented in Fig. 4 in a flow chart representation.
For the implementation of this workflow, we mainly used GROMACS functionalities 89 , MDAnalysis modules 70 , and pdb-tools 90 .The ASSAM source code was kindly provided by Prof. M. Firdaus Raih from the Molecular Function Regulation Lab (http:// mfrlab.org).
Step 1: motifs creation Starting from the provided PDB input file and the MD simulation (if present) of the protein-ligand complex, a list of residues defining the binding site is retrieved based on the distance between the ligand and the receptor.The default distance threshold is set to 10 Å, but this can be overridden with a custom value provided by the user as an additional parameter.If the user only provides a single PDB file of the protein-ligand complex, a single binding site is defined according to the chosen distance threshold between the ligand and the protein.Conversely, if the user provides also the MD trajectory, the obtained subset of coordinates from each frame of the simulation identifying the ligand and the residues of the protein binding site are clusterized with the K-Means algorithms implemented in MDAnalysis, which is based on the distances between atoms positions 70 .The user can set the desired number of clusters (k), otherwise, the optimal number of clusters is retrieved using the silhouette method ranging from a minimum of 2 clusters up to a maximum of 12 clusters.The silhouette method, as provided by the built-in scikit-learn function (sklearn.metrics.silhouette_score), is an example of an evaluation metric to indicate if clusters are well-defined.This method computes the mean distance between a sample and all other points within the same class (a) and the mean distance between a sample and all other points in the nearest neighbouring cluster (b).The Silhouette Coefficient, s, for a single sample, is then calculated as reported in Eq. (1): Then, the total Silhouette Coefficient is calculated as the mean of the Silhouette Coefficient for each sample.A higher Silhouette Coefficient score is indicative of denser and well-separated clusters, aligning with the conventional understanding of a cluster.It suggests that the samples within each cluster are tightly packed and distinct from samples in other clusters.This reflects a higher degree of cohesion within clusters and a greater separation between them.Therefore, the algorithm calculates the silhouette coefficient for different numbers of clusters (from 2 to 12) and then chooses the optimal number of clusters according to the best-achieved silhouette coefficient score.The subsequent centroids of the clusters are saved.
Starting from the previously defined binding sites, the subset of residues forming the binding site is further refined, highlighting those residues involved in non-covalent interactions with the ligand.These interacting residues are retrieved using the Protein-Ligand Interaction Profiler (PLIP) software 91 , which detects hydrogen bonds, hydrophobic contacts, pi-stacking, pi-cation interactions, salt bridges, water bridges, metal complexes, and halogen bonds between ligands and targets.This two-step analysis allows the definition of the most relevant protein residues for the interaction with the ligand.This subset of residues will be defined as motifs in the following.If only the PDB file is passed by the user, a single motif will be created; if an MD trajectory is provided, a motif will be produced for each centroid of the binding site.
Step 2: similarity search In the second step, the similarity search for each extracted motif is carried out using the ASSAM software.The fundamental principles underlying the search methodologies of ASSAM are described in previous literature 31,92 .In brief, the protein structure is represented as a graph, where nodes represent individual amino acid side chains and the geometric relationships between nodes form the edges of the graph.Each node is composed of two pseudo-atoms, which generate vectors that correspond to the nodes in the graph.The positions of these  www.nature.com/scientificreports/pseudo-atoms are strategically chosen to emphasize the functional portion of the corresponding side chain.The geometric relationships between pairs of residues are defined by distances calculated between their corresponding vectors, and these relationships are represented as the edges of the graph.If we let S, M and E denote the start, middle and end of a vector, the edges of the graph encompass five components: SS, SE, ES, EE, and MM distances, although only a subset of these distances is typically used to specify a query pattern.ASSAM employs a maximal common subgraph (MCS) approach using the Bron and Kerbosch MCS algorithm to enumerate all possible correspondences with similar protein patterns 93 .The choice to utilize ASSAM was made after an extensive examination of tools documented in the literature for assessing protein binding pockets (see also Table 1).Specifically, we scrutinized various methods, emphasizing those deemed most suitable for our objectives and compatible with integration into our pipeline.We specifically looked for methods that could compare a particular binding site with a broad set of proteins in a database (customizable if necessary), be executable locally rather than solely as a web server, provide output data on similar residues between the query protein and the target protein, perform searches quickly, and not require the prior definition of binding sites for proteins in the search database.This analysis pointed ASSAM as the most appropriate for the present work.In particular, ASSAM was chosen for (i) the simplicity in defining the binding site to be searched, i.e. a coordinate file containing the residues of interest in PDB format, (ii) the possibility to screen against any desired database in PDB format, (iii) the comparably fast time to solution, and (iv) the possibility of considering and preserving both right-handed and left-handed orientations of the alphahelices to retrieve the superpositions.A left-handed α-helical bundle superimposed onto a right-handed one is not equivalent, particularly concerning the handedness of the two groupings of amino acids.However, when it comes to chemical activity, two groupings of amino acids may exhibit the same behaviour despite having different handedness.The crucial factor, in this case, is the distance between the individual residues 31 .Thanks to the source code kindly provided by the ASSAM developers, the proposed pipeline does not rely on the ASSAM web-based analysis tools and runs entirely offline on-premises.
The output of the ASSAM search is formatted as a list, in which each row corresponds to a hit protein found in the screened protein database, along with its PDB accession ID, which presents a match with the residues of the input motif, also referred to as query.The matching residues between the query and the hit are also indicated, as well as the root-mean-square deviation (RMSD) value between query residues and hit residues after alignment.Finally, additional pieces of information are retrieved, namely the number of the initial conformation from which the query motif was created, and further information on the hit protein obtained from the RCSB Protein Data Bank site, such as the DOI of the corresponding publication and the EC classification.

Step 3: multi-step filtering
To further refine the output from the previous steps and select only the protein hits with accessible and highaffinity binding sites, two additional filtering steps, i.e. (i) the SASA and (ii) the docking filters, have been implemented.
The first step involves the calculation of the Solvent-Accessible Surface Area (SASA) using GROMACS 89 .In detail, for each hit protein, the SASA, evaluated in nm 2 , is calculated by measuring the value for all the residues matching the residues of the query protein's motifs.Values equal to zero indicate that the cleft identified by the residues is not solvent-exposed, but is rather buried in the structure of the protein and therefore likely inaccessible for the solvent or the ligand.Values that are greater than zero, on the other hand, indicate that the cleft formed by the hit residues is on the surface of the corresponding protein and therefore might allow for ligand binding.Only hits with the binding site having SASA greater than a predefined threshold of the SASA value of the corresponding query motif are retained.The SASA criterion is summarised by Eq. ( 2): where SASA HIT corresponds to the calculated SASA value of the given hit motif, SASA query is the SASA value of the corresponding motif of the query protein-ligand complex, and SASA threshold is the threshold to select only hits with the desired SASA.SASA threshold is set to 75% by default, but the user can specify a custom value in the additional input parameters.The selection of this threshold was imposed as a compromise to obtain a reasonable number of protein hits that could be effectively and rationally considered while retaining targets with motifs that are solvent-exposed and prone to ligand docking.The decision to set as default a high SASA threshold is based on the importance of solvent-accessible surface area (SASA) in assessing the capability of a protein binding site to accommodate ligands.Considering only SASA values considerably lower than the original protein binding site could compromise the algorithm's ability to retain only those binding sites most similar to the reference complex and thus more prone to ligand binding.Setting the SASA threshold to higher values, on the other hand, may lead to the exclusion of potentially promising binding sites and protein hits.As a result, users have the flexibility to adjust this threshold based on their screening objectives.
The second filtering step relies on the molecular docking of the original ligand in the query complex onto the retrieved hits from the previous steps.The docking procedure was implemented using SPORES and PLANTS 94,95 .PLANTS software was chosen since it exhibited excellent performance in terms of pose prediction and time to solution compared to other molecular docking software 96,97 and since it has been already successfully used for molecular docking and virtual screening campaigns on GPCRs [98][99][100][101][102] .These qualities make PLANTS particularly well-suited for the present work.Only hits with binding sites exhibiting a docking score (DSCORE) similar to or below the docking score of the original protein-ligand complex are retained.In particular, the docking score of the selected hit ( DSCORE HIT ) is compared with the docking score on the original query complex ( DSCORE query ) adjusted by a defined percentage using the specified docking threshold ( DSCORE threshold ).The docking threshold is set by default as 10% of the original protein-ligand complex docking score, if only a PDB is provided as input, (2) SASA HIT > SASA query * SASA threshold or the 10% of the average PLANTS docking scores between the centroids of the original protein-ligand complex, if the MD trajectory was specified.The docking filtering criterion is therefore defined by Eq. (3): Step 4: enrichment and signalling pathway analyses In this step, functional enrichment and signalling pathway analyses were performed to collect information regarding roles, functions, distributions, expressions and pathways in which the identified targets from the previous steps are involved.The pipeline automatically retrieves the unique UniProt IDs of the protein hits (since several PDB codes can correspond to the same protein) and searches for related genes using the DAVID functional annotation program 103,104 .We decided to focus our attention on the Gene Ontology (GO) terms 105 related to the Cellular Components (CC), Biological Processes (BP) and Molecular Functions (MF).The resulting GO terms were used to identify significantly enriched functional categories, using a statistical approach that takes into account the size of the protein list, the number of genes associated with each GO term, and the background gene set.We also performed pathway analysis using the Kyoto Encyclopedia of Genes and Genomes (KEGG) and the Reactome databases to identify the most represented pathways [106][107][108] .GO terms and KEGG pathways with corrected for multiple testing q-value < 0.1 were considered to be significantly enriched.The correction of the p-values and the calculation of the q-values were performed using the Benjamini-Hochberg FDR adjustment method 109 .The pipeline automatically generates separate plots with the above-mentioned analysis.

Database curation
The developed tool can screen a database of target proteins in the PDB format for binding pocket similarities against a query protein-ligand complex.For the present work, we have collected and refined all the experimentally solved PDB structures belonging to the human proteome.We first retrieved all the solved PDB codes belonging to homo sapiens from the NCBI database 110 using a dedicated Python API (Bio.Entrez package), and obtained a total of 60,159 entries (1st February 2023).We downloaded all the found queries from the RCSB database (http:// www.rcsb.org/) 111 , reaching a total of 59,267 structures (892 PDBs were not available).Then, the database was cleaned by removing (i) the chains in the PDBs not belonging to the human organism (such as in the case of protein chimaeras), (ii) multiple models from each PDB, (iii) ANISOU and HETATM lines, (iv) alternative locations of the atoms in the PDB by preserving the ones with the highest occupancy.At the end of this cleaning protocol, we ended up with 58,972 PDB files.

System setup
The previously described workflow was applied to search for proteins sharing similar binding pockets with a human bitter taste receptor.We employed the recently-solved TAS2R46 human bitter taste receptor bound with an agonist bitter compound, named strychnine (PDB ID: 7XP6) 71 .We first removed undesired molecules and structures from the PDB, preserving only the bitter taste receptor and strychnine.Since the receptor structure presents some missing residues (157-172), we downloaded the relative model (sequence P59540 of Homo Sapiens hTAS2R46) from the AlphaFold Protein Structure Database 112 .The AlphaFold model was then aligned to the receptor in the 7XP6 PDB.Missing residues in the original experimental structure were then built by homology modelling with MOE (Molecular Operating Environment) software 113 using the relative portion of the AlphaFold model as a template, thus filling the gap in the original experimental structure.Then, the structure of strychnine was refined using MOE, assigning the correct protonation at neutral pH and salt concentration of 0.15 M (see also Fig. S6).As reported in previous literature 71 , the tertiary amine of the molecule is protonated and its total charge is + 1, as also reported in the DrugBank database (https:// go.drugb ank.com/ drugs/ DB159 54).
The obtained complex was embedded into a POPC membrane using the CHARMM-GUI web server 114 .The receptor was aligned with the membrane bilayer using the OPM web server included in the CHARMM-GUI preparation protocol.The final number of POPC lipids (165) was set by a trial-and-error approach to creating a box large enough to meet the minimum image convention for the protein during the MD simulation: the final box size was set to 8.36 × 8.36 × 11.47 nm, with the z direction perpendicular to the cell membrane plane.The system box was filled with water and neutralized using NA + and CL -ions at 0.15 M physiological concentration.
We used the AMBER forcefield to describe the molecular system: in detail, we used the Lipid-21 forcefield 115 for lipids, the AMBER19SB 116 for the protein, ions and water and the General Amber Force Field (GAFF2) forcefield 117 to obtain the topology for strychnine, as implemented directly in the CHARMM-GUI suite.

Simulation protocol
The simulation workflow suggested by CHARMM-GUI was followed during minimisation, equilibration with position restraints and final simulation production.First, the system was energy-minimized using the steepest descent algorithm for 5000 steps.Then, six equilibration steps were performed gradually reducing the position restraints on the lipids and protein-heavy atoms (from 1000 to 0 kJmol −1 nm −1 for lipids, from 4000 to 50 kJmol −1 nm −1 for the protein backbone and from 2000 to 0 kJmol −1 nm −1 for protein side-chain heavy atoms).The system was equilibrated in the NVT ensemble for 250 ps with a conservative timestep of 1 fs, using the Berendsen thermostat 118 with a coupling time constant of 1 ps and a reference temperature of 303.15 K, which is above the phase-transition temperature for POPC, and subsequently in the NPT ensemble for 125 ps with the same 1 fs timestep, followed by a further simulation of 375 ps with a 2 fs timestep, using the Berendsen thermostat

Figure 1 .
Figure 1.Main interactions defining the three motifs for the bitter taste receptor interacting with strychnine.(A) Representative snapshot of the bitter-ligand complex, (B) PLIP interaction analysis identifying Hydrophobic Interaction (HI) and Saltbridge Interaction (SB), (C, D, E) site views of the three motifs identified.The bitter taste receptor is represented in grey, the strychnine in blue and the interacting residues in green (hydrophobic interactions) and purple (salt bridges).

Figure 2 .
Figure 2. (A) Number of total structures in the original human proteome database and the number of selected hits from the similarity search and the subsequent multi-step filtering.(B) Binding site view of the strychnine bound to the best hit (PDB: 3A4S) according to the docking score at the end of the multi-step filtering process.Protein is rendered in grey, strychnine in blue, residues in the original motif of the bitter taste receptor in red and matching residues in the 3A4S structure in violet.

Figure 3 .
Figure 3. Bar plots representing the best 5 retrieved GO terms for each category in the third level of the GO hierarchy relative to Biological Processes (BP), Molecular Functions (MF) and Cellular Components (CC).

Figure 4 .
Figure 4. Flow Chart of the overall workflow of VirtuousPocketome.